Data Wrangling for Risk Clustering & Outcome Variables

## Data Wrangling ## 

#1. Create a version of the data containing baseline risk variables and longitudinal outcome variables for clustering EDA
clustering_eda_data <- full_join(risk_variable_data, outcome_variable_data) %>% 
  dplyr::select(-site, -age, -race_ethnicity, -sex, -family_id)

#2. Display distribution of data
clustering_eda_data %>% 
  skimr::skim()
Data summary
Name Piped data
Number of rows 83378
Number of columns 26
_______________________
Column type frequency:
character 2
numeric 24
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
participant_id 0 1 12 12 0 11878 0
session_id 0 1 7 7 0 9 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
mh_p_cbcl__dsm__dep_tscore 16349 0.80 54.71 6.58 50.00 50.00 51.00 57.00 96.00 ▇▂▁▁▁
mh_p_cbcl__dsm__anx_tscore 15888 0.81 54.11 6.37 50.00 50.00 51.00 55.00 100.00 ▇▁▁▁▁
mh_p_cbcl__synd__attn_tscore 2944 0.96 53.47 5.66 50.00 50.00 51.00 55.00 100.00 ▇▁▁▁▁
mh_p_cbcl__synd__aggr_tscore 2938 0.96 52.18 4.79 50.00 50.00 50.00 52.00 100.00 ▇▁▁▁▁
mh_p_gbi_sum 74185 0.11 1.22 2.65 0.00 0.00 0.00 1.00 28.00 ▇▁▁▁▁
mh_y_upps__nurg_sum 74185 0.11 8.48 2.63 4.00 7.00 8.00 10.00 16.00 ▆▇▇▂▁
mh_y_upps__purg_sum 74185 0.11 7.94 2.93 4.00 6.00 8.00 10.00 16.00 ▇▆▆▂▁
le_l_coi__addr1__coi__total__national_zscore 74185 0.11 0.01 0.03 -0.11 -0.01 0.02 0.04 0.08 ▁▂▅▇▅
fc_p_nsc__ns_mean 74185 0.11 3.92 0.95 1.00 3.33 4.00 4.67 5.00 ▁▁▃▅▇
sds_total 74185 0.11 36.31 8.05 26.00 31.00 34.00 40.00 126.00 ▇▁▁▁▁
family_history_depression 74185 0.11 0.31 0.46 0.00 0.00 0.00 1.00 1.00 ▇▁▁▁▃
family_history_mania 74185 0.11 0.05 0.22 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
bullying 74185 0.11 0.25 0.44 0.00 0.00 0.00 1.00 1.00 ▇▁▁▁▃
nc_y_nihtb__lswmt__uncor_score 74185 0.11 97.04 11.95 36.00 90.00 97.00 105.00 136.00 ▁▁▅▇▁
nc_y_nihtb__flnkr__uncor_score 74185 0.11 94.19 9.02 51.00 90.00 96.00 100.00 116.00 ▁▁▃▇▂
nc_y_nihtb__pttcp__uncor_score 74185 0.11 88.36 14.48 30.00 80.00 88.00 99.00 140.00 ▁▂▇▅▁
ACE_index_sum_score 74185 0.11 1.95 1.35 0.00 1.00 2.00 3.00 7.00 ▇▅▆▁▁
si_passive 1839 0.98 0.09 0.28 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
si_active 1839 0.98 0.08 0.27 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
sa 1839 0.98 0.02 0.14 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
nssi 1839 0.98 0.07 0.26 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
bipolar_I 40868 0.51 0.01 0.10 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
bipolar_II 40868 0.51 0.00 0.06 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
any_bsd 40868 0.51 0.08 0.27 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁

Euclidean Separability of Risk Variables - is Clustering Appropriate?

Before employing clustering methods to identify meaningful latent groupings related to bipolar disorder and suicidality outcomes, it is critical to visually assess whether baseline risk variables demonstrate meaningful Euclidean separability. In other words, we must verify whether the baseline risk variables, both continuous (e.g., CBCL DSM-5 depression scores) and binary (e.g., family history of depression), form distinct, visually identifiable groups when plotted in pairs and colored according to longitudinal outcomes across assessment timepoints (baseline through 6-year follow-up).

This initial visual evaluation is foundational, as it determines whether clustering methods, which rely heavily on Euclidean or similar distance measures, are suitable for capturing latent risk groups that meaningfully predict clinical outcomes.

We will thus generate bi-plots of the 5 baseline risk variables most associated (assessed via point-biserial correlation) with each color-coded outcome variable (i.e., binary diagnostic outcomes and continuous CBCL scores) at every assessment timepoint for which they are available:

Follow-Up Tests to Further Assess Optimization Feature Separability Before Landing on Clustering

Before committing to any clustering solution, we want to further assess (beyond biplots) whether our 16 baseline risk variables can meaningfully distinguish outcome groups in Euclidean space. Two questions from Matt Sullivan addressed herein guide this process:

  1. Univariate separability: Does a single feature (e.g., Sleep Disturbance) alone already classify “No” vs “Yes” for diagnoses of interest?

  2. Feature ranking: Which variables show the strongest separation and warrant emphasis?

1. Sleep Disturbance Univariate Separability Test

Goal

Demonstrate on a concrete example (i.e., baseline Sleep Disturbance vs. bipolar_I at the 6-year follow-up) that a single feature can or cannot already separate outcome groups. If it does:

  • Formal test confirms group means differ

  • Two modes become the natural 1-D cluster centroids

Why bipolar_I @ 6Y? I picked this wave and outcome because bipolar_I at year 6 is our longest-term key clinical endpoint with good sample size and data quality. “Success” here means Sleep Disturbance has predictive value for future mania onset (exactly the kind of univariate signal clustering would leverage)

Steps & Rationale

  1. Normality check → test choice
- Shapiro–Wilk in each group → if both p > .05 use Welch t-test; else Mann–Whitney U

- Cohen’s d quantifies effect size (how far apart the “No” vs “Yes” means really are)
  1. One normal vs two normals?
- Fit Gaussian mixture models (GMMs) with 1, 2, 3 components on the pooled Sleep Disturbance scores

- ΔBIC + likelihood-ratio test tells us if two modes are statistically justified

- The two-component GMM’s means are exactly where a 1-D k-means (k = 2) algorithm would place its centroids
  1. Connection to clustering
- In 1-D, k-means/GMM centroids land at the two modes; the decision boundary is the midpoint

- Thus, if Sleep Disturbance alone separates “No” vs “Yes,” it already behaves as a near-perfect univariate classifier

To note, the same workflow (Shapiro → test → Cohen’s d → GMM → 1-D centroids) can be applied to any other risk feature and any other follow-up outcome/timepoint

Fit GMM with 1 component (null model)

Fit GMM with 2 components (alternative model)

Fit GMM with 3 components (for BIC comparison)
Value Comment
Mann-Whitney U 316093.5 p<.001: groups differ
Cohen’s d 0.3 mod/small
ΔBIC (2−1) 2381.1 strong support 2 modes
LRT p-value 0.0 prefer 2-component

Plot breakdown: Youth who develop BD-I by 6-year follow-up (“Yes,” red) show a modest rightward shift in Sleep Disturbance at baseline relative to those who do not (“No,” blue). A Mann–Whitney U test confirms the groups differ (p < .001) with Cohen’s d ≈ 0.3. Fitting a two-component Gaussian mixture uncovers two distinct modes (dashed lines) that serve as the natural 1-D k-means centroids, demonstrating that Sleep Disturbance alone yields a somewhat intuitive univariate clustering boundary

Answer to Q1: Sleep Disturbance Univariate Separability

Formal group‐difference test:

  • Mann–Whitney U p < .001 confirms the “Yes” vs “No” groups differ on baseline Sleep Disturbance

  • Cohen’s d ≈ 0.3 indicates a small‐to‐moderate mean shift, matching the partial overlap in the histograms

Normal vs two‐normal comparison:

  • A two‐component Gaussian mixture is strongly preferred (ΔBIC ≈ 2381; LRT p ≈ 0), so there truly are two modes

  • Those component means (dashed lines) coincide with where a 1-D k-means (k=2) would place its centroids

Clustering interpretation in 1-D:

  • In one dimension, k-means/GMM centroids sit at the two modes and classify cases by the midpoint.

  • Thus Sleep Disturbance alone already yields a natural 2-cluster solution—an almost-perfect univariate classifier, though with only modest discrimination (d≈0.3)

2. Feature Ranking by Univariate Effect Size

Goal

Identify which baseline risk features show at least medium univariate association with our key outcome (here, bipolar_I at 6 years), so we know which variables carry the strongest marginal signal. We still retain all 16 features in the master set—this ranking only tells us which deserve a closer look first.

Why bipolar_I @ 6 Y?

This is our longest‐term clinical endpoint with the largest sample and best data completeness. A variable that shows even a medium effect here is a strong candidate for driving cluster structure.

Steps & Rationale

  1. Compute Cohen’s d for each baseline risk variable comparing “No” vs “Yes.”
– |d| > 0.5 flag ⇒ medium effect; > 0.8 ⇒ large effect
  1. Rank all features by |d|

  2. Short-list those with |d| > 0.5 for further 2-D checks (but keep the full set for clustering)

Note: the same pattern (compute |d| for binary, η² for CBCL‐quintiles) can be applied to any other outcome or to continuous outcomes turned into quintile groups

Risk Variable Cohen’s d Abs. Value d Effect Size
CBCL Depression T -0.52 0.52 medium
ACE Index -0.39 0.39 small
Sleep Disturbance -0.30 0.30 small
CBCL Anxiety T -0.29 0.29 small
CBCL Attention T -0.26 0.26 small
Fam Hx Dep -0.24 0.24 small
CBCL Aggression T -0.23 0.23 small
GBI Mania Score -0.22 0.22 small
UPPS Neg Urgency -0.22 0.22 small
UPPS Pos Urgency -0.16 0.16 small
Bullying -0.15 0.15 small
NIHTB Flanker -0.14 0.14 small
NIHTB Working Mem -0.09 0.09 small
NIHTB Proc Speed -0.07 0.07 small
Fam Hx Mania -0.07 0.07 small
Child Opportunity Z -0.02 0.02 small
Neighborhood Safety -0.02 0.02 small

Only baseline CBCL Depression T shows a medium univariate effect (|d| = 0.52) for associations with future Bipolar I at 6-year follow-up; every other baseline risk factor ranks in the small range (|d| from 0.39 down to 0.02), with ACE Index (|d| = 0.39) and Sleep Disturbance (|d| = 0.30) as the next strongest

This tells me two things:

  1. No single predictor can strongly separate “No” vs “Yes” cases on its own; every variable carries modest signal

  2. A multivariate approach like clustering in the full 16-dimensional risk space (I think) can likely harness the combined power of these small‐effect variables (and their interactions) to form more discriminative risk groupings than any one variable alone; and the bi-plots above show that this may be possible to some degree